在資料處理的領域,除了前幾天說的那些overview之外,了解了整個宏觀的資料集,我們還是會需要深入去確認資料的樣態,當資料集有排序性的時候,使用show()
,limit()
,甚至是你用filter()
,效果都沒有使用sample()
的效果好
那我們就快點來看看sample()
要再怎麼使用吧~
開始囉!
sample()
sample(withReplacement,fraction,seed)
withReplacement
– 採樣之後是否直接覆蓋原來的資料,一般情況下是false,不做替換,也就是回傳一個新的資料,原來的資料不變。fraction
– 採樣的比例參數, 範圍是[0.0, 1.0].seed
– 取樣的隨機種子,這個也是隨機取樣,就像你產生隨機資料一樣,也是需要隨機種子
rdd = sc.parallelize(
[
("drink", 2, "Carmen",23,'Female'),
("movie", 2, "Juliette",16,'Female'),
("write", 2, "Don José",25,'Male'),
("sleep", 2, "Escamillo",30,'Male'),
("play", 2, "Roméo",18,'Male'),
("swim", 3, "Vivi",18,'Female'),
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.sample(withReplacement=True, fraction=0.5, seed=4).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-------+----+---------+---+------+
| drink| 2| Carmen| 23|Female|
| movie| 2| Juliette| 16|Female|
|writing| 2| Don José| 25| Male|
| sleep| 2|Escamillo| 30| Male|
| play| 2| Roméo| 18| Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.sample(withReplacement=True, fraction=0.5, seed=4).show()
+-----+----+--------+---+------+
|Thing|Hour| Name|Age|Gender|
+-----+----+--------+---+------+
|movie| 2|Juliette| 16|Female|
|write| 2|Don José| 25| Male|
+-----+----+--------+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
sampleBy()
sampleBy(col,fraction,seed)
col
– 選擇特定的欄位進行採樣fraction
– 每個層的抽樣分數。 如果未指定層,我們將其分數視為零。.seed
– 取樣的隨機種子,這個也是隨機取樣,就像你產生隨機資料一樣,也是需要隨機種子
rdd = sc.parallelize(
[
("drink", 2, "Carmen",23,'Female'),
("movie", 2, "Juliette",16,'Female'),
("write", 2, "Don José",25,'Male'),
("sleep", 2, "Escamillo",30,'Male'),
("play", 2, "Roméo",18,'Male'),
("swim", 3, "Vivi",18,'Female'),
("swim", 3, "Gary",18,'Male'),
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.sampleBy(col("Hour"), fractions={2: 0.5, 3: 1}, seed=1).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-----+----+---------+---+------+
|Thing|Hour| Name|Age|Gender|
+-----+----+---------+---+------+
|drink| 2| Carmen| 23|Female|
|movie| 2| Juliette| 16|Female|
|write| 2| Don José| 25| Male|
|sleep| 2|Escamillo| 30| Male|
| play| 2| Roméo| 18| Male|
| swim| 3| Vivi| 18|Female|
| swim| 3| Gary| 18| Male|
+-----+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.sampleBy(col("Hour"), fractions={2: 0.5, 3: 1}, seed=1).show()
+-----+----+---------+---+------+
|Thing|Hour| Name|Age|Gender|
+-----+----+---------+---+------+
|write| 2| Don José| 25| Male|
|sleep| 2|Escamillo| 30| Male|
| swim| 3| Vivi| 18|Female|
| swim| 3| Gary| 18| Male|
+-----+----+---------+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!
我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】